Submission for Fundamentals of Social Data Science

2020.12.09
Candidate number: 1047951
Word Count: 3499

Part 1. Exploring the Relationship Between Rural Areas and COVID-19

Introduction

The purpose of this essay is to analyze the relationship between COVID-19 spread and death rate, and rural areas. Our world is characterized by heterogeneity in urban development; understanding whether there is a difference in the case and death outcomes between rural and urban regions can help inform present policy decision-making and aid in the formation of pre-emptive policy measures in the future.

Rural areas are territories with low population density; however, it is not feasible to narrow down the definition since it varies by country (World Bank, 2020). This is elaborated in the limitations section together with its implications on the research.

Covid-19 spread and rural areas

COVID-19 might spread both faster and slower in rural areas. The spread might be faster in rural areas because of factors such as worse law enforcement (e.g. NRCN, 2015) which enables greater free-riding; inferior sanitary conditions (e.g. Chaudhuri & Roy, 2017), while sanitation has been noted as essential for preventing COVID-19 (WHO, 2020); Slower information dissemination practices (Salim, 2013); Lower adoption and efficiency of contact tracing technology in rural areas (e.g. Pew Research Center, 2019)

COVID-19 might also spread slower due to less inbound traveling and mobility in urban areas that have been associated with increased COVID-19 cases (Badr et al., 2020). Furthermore, the lower density of people in rural makes it less likely to get into close contact with one another, ceteris paribus. Given the ambiguous relationship, I hypothesize that:

H1: There is no observable difference between the number of COVID-19 cases and the percentage of the rural population in a country.

Covid-19 death and rural areas

In the paper, I operationalize deaths as deaths per the number of cases, as opposed to death per population, to control for the number of cases in my analysis.

The death rate might be higher in rural areas because there are fewer and inferior medical facilities. There are often fewer doctors per capita in rural areas (OECD, 2015) with more obstacles to overcome for comparative healthcare (Palmer et al., 2019). Furthermore, rural dwellers tend to be older (e.g. DEFRA, 2018), exposing them to greater risk to COVID-19. However, they might have fewer death rates due to better health conditions (Spasojevic et al., 2015). I argue that fewer and inferior medical facilities are the predominant factors that contribute to higher deaths in rural areas.

H2: The COVID-19 death rate increases as a country's relative rural area increases.

A summary of the relationships is provided below.

COVID-19 and rural areas

Methodology

In this analysis, I do not control or link any of the aforementioned factors, but, rather, visually inspect rural areas and COVID-19 deaths and cases to find patterns in the relationship; the theoretical link described can aid in explaining the logical mechanism for their relationship. Daily as opposed to cumulative numbers for cases and deaths to capture the dynamic relationship. Variables are aggregated to different time units where appropriate.

I adjust the variables to see the relationship per capita and control for the population size. The World Bank did not have data for 30 countries in the initial OXCOVID19 data, thus they were dropped. Furthermore, 63 countries had negative growth of daily COVID-19 cases (calculated as the case difference between two days). Excluding all these countries would be harmful to the scope of the analysis, thus only the countries with a cumulative drop of more than 5% of total cases were removed and other countries were left. Four countries were thus removed from the analysis due to unreliable data.

The key focal point of the exploratory data analysis (EDA) is to explore:

Including governmental measures attempts to explain the difference, if exists, between rural areas and COVID-19 death and spread.

To make the analysis more visually presentable, countries are divided into small (up to 40% rural), medium (up to 40-70% rural), and high (70% or more) categories. Data quality is ensured in different stages of the analysis.

Analysis

Extracting data

Connecting to the database

Extracting data on COVID-19

Extracting data on Rural Areas

EDA for rural areas

Rural Distribution (Figure 2)

Correlation Matrix (Figure 3)

Temporal Overview (Figure 4)

Split by Continent (Figure 5,6)

EDA for government intervention

Extracting governmental data

Plotting government intervention timelines (Figure 7)

Plotting weekly cases by rural area and time (Figure 8)

Results

In this part, I present the results of the research by (a) overviewing the distribution of rural areas, (b) observing the relationship between rural areas and COVID-19 cases, and (c) offering possible explanations.

First, we can observe that the largest parts of the rural population are located in the Global South, as presented in Figure 2.

This can have important implications for the research, as looking at rural areas might become a proxy for economic development. In particular, the Pearson's correlation coefficient between economic development (defined by GDP per capita, adjusted for PPP) and rural population is -0.64.

The correlation between COVID-19 deaths, cases, rural population, as well as GDP and population density is provided in Figure 3.

Rural population is rather strongly correlated with confirmed cases per 100 thousand people (r=-0.43) and deaths (r=-0.38). This means that as a country has a more rural population, it should have fewer confirmed cases and deaths per 100 thousand people. Note that there is little correlation with dead people per cases which would control for the number of cases in the population. This is likely because confirmed and dead cases per 100 thousand people are themselves correlated at r=0.78, thus deaths per capita do not properly control for cases.

Figure 4 presents the cumulative number of cases for different rural categories on the left, as well as their daily changes on the right. This enables us to view the difference between rural regions and COVID-19 in time.

The number of COVID-19 cases has been the highest for countries with fewer rural areas adjusted for the population during all months. Note that the daily changes were consistently higher for countries with a lower population too.

It is important to look at different contexts, however. It might be the case that this is not a result of rural areas per se, but, rather, economic development. One way to control for this is to separate the analysis by continent and observe whether countries with fewer rural categories still have more cases within each continent. Figure 5 presents the distribution of COVID-19 cases for each category by continent.

It seems that countries in Africa, North America, and Asia have slightly more cases if they have lower rural categories, yet with a substantial degree of overlap. There is insufficient data for South America, while in Europe there seems to be no visual difference between the two. Figure 6 depicts similar distributions for the number of deaths.

There seems to be no difference in the peaks of the distributions in all of the continents with the possible exception of Asia, suggesting that there is no clearly observable relationship between rural categories and COVID-19 cases.

From the data that has been observed so far, I reject my first hypothesis stating that there is no relationship between COVID-19 cases and rural populations, as the visualizations do suggest a difference exists. Likewise, I reject my second hypothesis, as there seems to be no observable difference in the number of deaths, operationalized as deaths per cases.

One reason why countries with smaller rural populations have higher cases could be because a result of effective government intervention. To understand whether government intervention was the key factor for different COVID-19 cases, I look into national stay-at-home policy enforcement. The figure below showcases when the governments intervened with stay-at-home requirements by rural area type.

Visually, it seems that most of the interventions happened between March and April, regardless of the rural category. However, simply looking at the time of the intervention is insufficient, since the case count might have been drastically different at the same point in time. Hence, Figure 8 looks at the relative distribution of cases for COVID-19 in terms of weeks before and after the government intervention, divided by rural category.

Each box is an average of the number of new weekly cases per 100 thousand people in a given rural population bin, relative to when nation-wide stay-at-home requirements were introduced.

A clear pattern emerges that countries with lower populations had more cases both before the introduction of the requirements, but especially afterward. The number of cases after the introduction of the requirements increased the most for countries with low rural populations and persisted for a longer time. However, this data does not support the hypothesis that effective government intervention was the reason for fewer cases in rural areas.

Overall, more analysis is required to confirm the existence of the hypothesized relationships.

Part 2. Fair and balanced? A comparison of the sentiments of two major media outlets in the context of the COVID-19 pandemic

Introduction

There is a systemic divide between Republicans and Democrats with regards to their preferred news outlet. Republicans trust Fox News and distrust CNN the most, with the opposite holding for Democrats (Jurkowitz et al., 2020). This phenomenon has important implications for information dissemination to the general public. For example, an increase in Fox News viewership is associated with non-compliance with social distancing during the pandemic (Simonov et al., 2020) or consumption of fewer virus-protective products, such as masks or hand sanitizers (Ash et al., 2020).

In this essay, I build upon this research by analyzing the sentiments of coronavirus-related articles from CNN and Fox News. The existing research suggests that the effect of Fox News on viewer behavior is partially attributed to program-specific content that minimized the COVID-19 threat (ibid.). Therefore, I hypothesize that:

H1: Fox News coronavirus-related article sentiment is significantly more positive than CNN article sentiment.

During the pandemic, US elections also took place. Mass media is an important factor that can shape election results (Graber & Dunaway, 2017). For instance, Fox News has previously shaped elections in favor of the Republican party (Martin & Yurukoglu, 2017). Given that the viewership of Fox News and CNN are Republicans and Democrats, respectively, I hypothesize that :

H2: Fox News articles mentioning the US president show greater positive sentiment than those which do not.

H3: CNN articles mentioning the US president show greater negative sentiment than those which do not.

Methodology

To achieve the goal of the study, all the media articles that have the word 'coronavirus' from CNN and FOX News starting from 2020-05-11 to 2020-11-20 are scraped. A rule-based sentiment analysis feature extractor VADER (Hutto & Gilbert, 2014) is used to analyze the data. The data is then used to:

Certain repetitive phrases are manually inspected and removed from FOX News and CNN that do not add any organic content, such as ‘EXCLUSIVE LIVE UPDATES’.

Lastly, I extract TF-IDF scores for all the articles and find similar articles using cosine similarity. A network is created in which each node is an article and each edge represents that two articles are related in content and sentiment. Articles are considered to be similar if their cosine similarity is above 0.4 and the difference between VADER sentiments is less than 0.2. This allows us to inspect whether similar articles from the same news outlet tend to cluster together or not and whether there are any visual differences in the sentiments portrayed by CNN and Fox News.

Data Extraction

Scraper Class

CNN Scraper Class

Fox News Scraper

Data Analysis

Import and data quality

Tokenizing and extracting sentiments

Testing difference in sentiments

Sentiment Distribution (Figure 9)

Sentiments and COVID-19 cases over time (Figure 10)

Comparing the relationship between sentiments and COVID-19 cases and deaths

Granger causality

Difference between sentiments that mention the president

Network Analysis

Building a Graph

Visualizing Graphs

CNN and Fox network (Figure 11)

CNN and FOX network by sentiment (Figure 12)

Largest connected component (Figure 13)

Network data check

Results

A total of 1265 Fox News and 9858 CNN articles were scraped with the query coronavirus. 91% of Fox News articles and about 20% of CNN articles had the query coronavirus in the title. To make the populations more comparable, only the articles with coronavirus in the title were kept, leaving about 1.1 and 1.9 thousand articles for FOX and CNN, respectively.

Sentiment analysis on the corpus of text was performed and an initial t-test showed that sentiments are different at p=0.0126 with CNN having more favorable sentiments (as measured by the compound score, a single unidimensional measure of sentiment with a higher score representing more positive sentiment). In my first hypothesis, I argued that FOX News should have a more favorable sentiment due to their close ties to the Republican party. Consequently, I am rejecting my first hypothesis based on the given data and conclude that FOX News does not have a more favorable news sentiment in the context of coronavirus. The sentiment distribution of CNN and FOX articles is presented in Figure 9.

The upper blue and lower red histograms depict CNN and FOX scores, respectively. The colors for the outlets were chosen to depict a proxy for mass media partisanship and are not reflective of their brand style design.

The compound scores have a bimodal distribution; positivity and negativity have positive skews, whereas neutrality seems to be rather normality distributed. Most articles have a rather high score for neutrality and low scores for positivity and negativity. For the purposes of this analysis, the compound score is mostly used. Although the t-test showed a significant result, the histograms suggest that the effect sizes are small.

The aggregated media sentiment has been changing throughout the pandemic. Th Figure 10 illustrates the daily change in sentiment together with daily change of COVID-19 cases.

The first graph depicts a rolling average of the daily case changes in the US relative to the previous day. The second and third graphs are CNN and FOX article sentiments, 14-day rolling average, over time. CNN articles seem to be more positive throughout the whole period with fewer fluctuations, something that has been suggested by the t-test.

However, absolute sentiment alone is not necessarily reflective of the relationship between COVID-19 cases, deaths, and media sentiment; analyzing the covariance of the variables might be useful. Table 1 showcases the Pearson's correlation coefficient for CNN and Fox News, daily death growth, and daily case growth.

The data suggests that CNN positive sentiment increases as the number of COVID-19 cases increase at p<0.01; the same is true for Fox News at p < 0.05. Although there could be theoretical mechanisms to justify this, I address the plausible reasons for this relationship in the limitations section. An additional way to for the relationship (especially for CNN given the small p-value) is by using Granger causality. If there is any theoretical link between positive sentiment and increase in COVID-19 cases, the relationship should hold only if the sentiment follows the deaths, and not vice versa. A Granger causality test can indicate whether this hypothesis is true.

the Granger causality test results indicate that Covid-19 cases G-cause CNN sentiment at p=0.0016. This can be interpreted by the positive sentiment of CNN articles increasing after COVID-19 cases increase. However, note that the stationarity assumption of the Granger causality was not met, and the linearity assumption should be further validated. A more robust approach is necessary to confirm this claim.

Furthermore, in the introduction, it was hypothesized that the articles that mention the president, defined as having the words 'Trump' or 'president' mentioned in the text, were more positive for FOX and negative for CNN. A comparison is made between the articles that mention the president within CNN and FOX News, respectively. The table with the sentiments is presented below.

Differences in positive, negative, and compound scores was measured for articles that mention the president and those that do not. The column outlet_sentiment_pres and outlet_sentiment_nopres show the sentiment for articles that do and do not mention the president for each sentiment group; the column p_value_pres and p_value_nopres show the p-value for the difference in the two sentiment values; significant_pres and significant_no_pres labels whether they are significant.

Given the large number of t-tests conducted, any relationship might have occurred simply by chance, thus a Bonferroni correction is applied. The only significant relationship after applying the correction is that of FOX News negative sentiment. Fox News tends to be significantly less negative in articles that do not mention the president. No relationship is observed for CNN with the given parameters. Thus, I rejected my second and third hypothesis that Fox News tends to be more positive and CNN more negative in articles that mention the president.

Moreover, to understand whether there is any visual difference between sentiments and COVID-19 articles, I build a network of articles, where each node is an article and each edge showcases that the article is similar in content and sentiment.

Blue and red nodes represent CNN and FOX articles, respectively. A Kamada Kawai layout is used to position the nodes in the network. The whole graph seems to be dominated by CNN articles, while the majority of articles are clustered in the red-blue circle that overlap each other. Figure 12 depicts the sentiment distribution of these articles.

The color scheme represents the compound score provided by the VADER sentiment analyzer. The network represents two different clusters of sentiments. The nodes located at the top have a negative score and the nodes at the bottom left are more positive. So far, this graph tells us that there are two clusters of CNN articles, one with positive, and one with negative sentiments, and there is no visual relationship between outlet and sentiment in the circle that has overlapping FOX and CNN articles. There appears to be no visual difference in the sentiments of the two media outlets. Figure 13 zooms into the largest connected component that also contains the two CNN clusters.

A Fruchterman-Reingold force-directed algorithm was used to position the nodes. The color represents the media channel, and the size of the node is the degree centrality of each node. Note that the red nodes do not have high degree centrality scores, showcasing that, in this graph, few Fox News articles are related in content and sentiment. CNN articles tend to be more closely related in sentiment and content.

One of the main reasons why the node layout looks like this, however, is that the largest CNN cluster forms a clique of 118 nodes that are daily summary coronavirus articles. These CNN summary articles do not produce new content but are summaries of a couple of days, artificially creating nodes that are similar in content and sentiment and making it harder to see any relationships. For future analysis, removing such nodes could be considered.

Limitations

Limitations Part 1

The study on rural areas has several important limitations. First, a whole host of important explanatory variables are not taken into account. For instance, Nguimkeua and Tadadjeub (2020) have found that geographic and demographic characteristics impact COVID-19 spread in Africa that are not considered in my essay. Second, the study does not control for country-level urban-rural divisions. It would be more accurate to look at the COVID-19 spread within a given country in the rural and urban populations which might give substantially different results (and could even result in the Simpson's paradox). This would also resolve the issue of rural areas being proxies for the economic development of a country. Third, rural areas are not defined homogeneously by different countries, making it hard to know the true distribution of rural areas. Fourth, there were data reliability issues with the OXCOVID-19 database, such as declining cumulative numbers.

Limitations Part 2

There are significant limitations to the analysis and, hence, to the conclusions reached.

References

Ash, E., Galletta, S., Hangartner, D., Margalit, Y., & Pinna, M. (2020). The Effect of Fox News on Health Behavior During COVID-19. SSRN Electronic Journal, 1–55. https://doi.org/10.2139/ssrn.3636762

Badr, H. S., Du, H., Marshall, M., Dong, E., Squire, M. M., & Gardner, L. M. (2020). Association between mobility patterns and COVID-19 transmission in the USA: a mathematical modelling study. The Lancet Infectious Diseases, 20(11), 1247–1254. https://doi.org/10.1016/S1473-3099(20)30553-3

Chaudhuri, S., & Roy, M. (2017). Rural-urban spatial inequality in water and sanitation facilities in India: A cross-sectional study from household to national level. Applied Geography, 85, 27–38. https://doi.org/10.1016/j.apgeog.2017.05.003

DEFRA. (2018). Statistical Digest of Rural England. www.nationalarchives.gov.uk/doc/open-government-licence/version/3/oremailPSI@nationalarchives.gsi.gov.ukwww.gov.uk/defra

Graber, D., & Dunaway, J. (2017). Mass Media and American Politics (10th ed.). Thousand Oaks, California : CQ Press, An imprint of SAGE Publications, Inc. https://us.sagepub.com/en-us/nam/mass-media-and-american-politics/book248903

Hutto, C. J., & Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014.

Jurkowitz, M., Mitchell, A., Shrearer, E., & Walker, M. (2020). U.S. Media Polarization and the 2020 Election: A Nation Divided | Pew Research Center. https://www.journalism.org/2020/01/24/u-s-media-polarization-and-the-2020-election-a-nation-divided/

Martin, G. J., & Yurukoglu, A. (2017). Bias in cable news: Persuasion and polarization. American Economic Review. https://doi.org/10.1257/aer.20160812

Nguimkeu, P., & Tadadjeu, S. (2020). Why is the number of COVID-19 cases lower than expected in Sub-Saharan Africa? A cross-sectional analysis of the role of demographic and geographic factors. World Development. https://doi.org/10.1016/j.worlddev.2020.105251

NRCN. (2015). The True Cost of Crime in Rural Areas. https://www.nationalruralcrimenetwork.net/content/uploads/2015/09/NRCN-National-Rural-Crime-Sur...pdf

OECD. (2015). Health at a Glance 2015: OECD Indicators. http://dx.doi.org/10.1787/health_glance-2015-en

Palmer, B., Appleby, J., & Spencer, J. (2019). Rural health care A rapid review of the impact of rurality on the costs of delivering health care. https://www.nuffieldtrust.org.uk/files/2019-01/rural-health-care-report-web3.pdf

Pew Research Center. (2019). Digital gap between rural and nonrural America persists | Pew Research Center. https://www.pewresearch.org/fact-tank/2019/05/31/digital-gap-between-rural-and-nonrural-america-persists/

Salim, A. (2013). ScienceDirect Management Information in Rural Area: A Case Study of Rancasalak Village in Garut, Indonesia. The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013), 11, 243–249. https://doi.org/10.1016/j.protcy.2013.12.187

Simonov, A., Sacher, S. K., Dubé, J.-P. H., Biswas, S., Petrova, M., Prat, A., Rao, A., Toubia, O., Ursu, R., & Yurukoglu, A. (2020). The Persuasive Effect of Fox News: Non-Compliance with Social Distancing During the Covid-19 Pandemic. http://www.nber.org/papers/w27237

Spasojevic, N., Vasilj, I., Hrabac, B., & Celik, D. (2015). Rural - Urban Differences in Health Care Quality Assessment. Materia Socio Medica, 27(6), 409. https://doi.org/10.5455/msm.2015.27.409-411

WHO. (2020). Water, Sanitation, Hygiene, and Waste Management for SARS-CoV-2, the Virus that Causes COVID-19. https://www.who.int/publications/i/item/WHO-2019-nCoV-IPC-WASH-2020.4

World Bank. (2020, October 26). Rural population | Data Catalog. https://datacatalog.worldbank.org/rural-population